| carat | cut | color | clarity | depth | table | price | x | y | z |
|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
| 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
| 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
| 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
| 0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
| 0.24 | Very Good | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 |
| 0.26 | Very Good | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 |
| 0.22 | Fair | E | VS2 | 65.1 | 61 | 337 | 3.87 | 3.78 | 2.49 |
| 0.23 | Very Good | H | VS1 | 59.4 | 61 | 338 | 4.00 | 4.05 | 2.39 |
## # A tibble: 12,777 x 26
## idnum state state2 stfips zip region typebldg floor room basement windoor
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <chr> <chr>
## 1 1 AZ AZ 4 85920 1 1 1 2 N <NA>
## 2 2 AZ AZ 4 85920 1 0 9 0 <NA> <NA>
## 3 3 AZ AZ 4 85924 1 1 1 3 N <NA>
## 4 4 AZ AZ 4 85925 1 1 1 3 N <NA>
## 5 5 AZ AZ 4 85932 1 1 1 1 N <NA>
## 6 6 AZ AZ 4 85936 1 1 1 1 N <NA>
## 7 7 AZ AZ 4 85936 1 0 9 0 <NA> <NA>
## 8 8 AZ AZ 4 85936 1 1 1 1 N <NA>
## 9 9 AZ AZ 4 85938 1 0 9 0 <NA> <NA>
## 10 10 AZ AZ 4 85938 1 1 1 3 <NA> <NA>
## # … with 12,767 more rows, and 15 more variables: rep <chr>, stratum <chr>,
## # wave <dbl>, starttm <chr>, stoptm <chr>, startdt <chr>, stopdt <chr>,
## # activity <dbl>, pcterr <dbl>, adjwt <dbl>, dupflag <chr>, zipflag <chr>,
## # cntyfips <chr>, county <chr>, log_radon <dbl>
Observations drawn from Normal distribution
\[ y \sim \mathcal{N}(\mu, \sigma) \]
Predictors determine mean, \(\mu\)
Choose functional form for this relationship
\[ f(x) \to \beta \mathbf{X} \to \mu \]
Each point drawn from an individual distribution
\[ y_i \sim \mathcal{N}\left(\sum \beta_j x_{ji}, \, \sigma\right) \]
Model fit \(\rightarrow\) determine values for \(\beta\)
How do we determine \(\beta\)?
Calculate \(\mu\) from data
Calculate log-likelihood of measurement, \(\mathcal{L}(y \, | \, \mu, \sigma)\)
Sum over all datapoints
Parameter uncertainty
Prior Knowledge
\(+\)
Data
\(=\)
Posterior Knowledge
Parameters, \(\theta\)
Data, \(D\)
Prior: \(p(\theta)\)
Likelihood: \(p(D | \theta)\)
Posterior: \(p(\theta | D)\)
\[ p(\theta \, | \, D) = \int p(\theta) \, p(D \, | \, \theta) \]
Posterior calculation is high-dim integral
Use MCMC to sample posterior
Probabilistic Programming Language
CmdStan, PyStan, rstan
rstanarm and brms
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
How do we use these posterior draws?
Negative prices?
Use logs
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
A word of caution…
Predicting log(price)
What about price?
Induce a uniform distribution?
What about small data?
How do we use structure?
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
Larger estimates \(\to\) smaller sample size
What about partial pooling?
\[\begin{eqnarray*} \text{Larger samples} &\to& \text{individual estimates} \\ \text{Smaller samples} &\to& \text{grouped estimates} \end{eqnarray*}\]
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
##
## Energy:
## E-BFMI indicated no pathological behavior.
Compiles to C++
Documentation improving rapidly
## data {
## int<lower=1> N;
## int<lower=1> J; # number of counties
## int<lower=1,upper=J> county[N];
## vector[N] x;
## vector[N] y;
## }
## parameters {
## vector[J] a;
## real beta;
## real<lower=0> sigma_a;
## real<lower=0> sigma_y;
## real mu_a;
## }
## model {
## vector[N] y_hat;
## for (i in 1:N)
## y_hat[i] = beta * x[i] + a[county[i]];
##
## beta ~ normal(0, 1);
## mu_a ~ normal(0, 1);
## sigma_a ~ cauchy(0, 2.5);
## sigma_y ~ cauchy(0, 2.5);
##
## a ~ normal(mu_a, sigma_a);
## y ~ normal(y_hat, sigma_y);
## }
Flexible
Censored, truncated data
Generative modelling
Not a magic bullet
Implementation more complex (by design)
Ecosystem developing rapidly
PyStan and arviz
rstanarm, tidybayes, bayesplot, shinystan
Email:
GitHub: